46 research outputs found
Chiral Bosonization of Superconformal Ghosts
We explain the difference of the Hilbert space of the superconformal ghosts (beta,gamma) system from that of its bosonized fields phi and chi. We calculate the chiral correlation functions of phi, chi fields by inserting appropriate projectors
Revisiting Discrete Soft Actor-Critic
We study the adaption of soft actor-critic (SAC) from continuous action space
to discrete action space. We revisit vanilla SAC and provide an in-depth
understanding of its Q value underestimation and performance instability issues
when applied to discrete settings. We thereby propose entropy-penalty and
double average Q-learning with Q-clip to address these issues. Extensive
experiments on typical benchmarks with discrete action space, including Atari
games and a large-scale MOBA game, show the efficacy of our proposed method.
Our code is at:https://github.com/coldsummerday/Revisiting-Discrete-SAC
Revisiting Estimation Bias in Policy Gradients for Deep Reinforcement Learning
We revisit the estimation bias in policy gradients for the discounted
episodic Markov decision process (MDP) from Deep Reinforcement Learning (DRL)
perspective. The objective is formulated theoretically as the expected returns
discounted over the time horizon. One of the major policy gradient biases is
the state distribution shift: the state distribution used to estimate the
gradients differs from the theoretical formulation in that it does not take
into account the discount factor. Existing discussion of the influence of this
bias was limited to the tabular and softmax cases in the literature. Therefore,
in this paper, we extend it to the DRL setting where the policy is
parameterized and demonstrate how this bias can lead to suboptimal policies
theoretically. We then discuss why the empirically inaccurate implementations
with shifted state distribution can still be effective. We show that, despite
such state distribution shift, the policy gradient estimation bias can be
reduced in the following three ways: 1) a small learning rate; 2) an
adaptive-learning-rate-based optimizer; and 3) KL regularization. Specifically,
we show that a smaller learning rate, or, an adaptive learning rate, such as
that used by Adam and RSMProp optimizers, makes the policy optimization robust
to the bias. We further draw connections between optimizers and the
optimization regularization to show that both the KL and the reverse KL
regularization can significantly rectify this bias. Moreover, we provide
extensive experiments on continuous control tasks to support our analysis. Our
paper sheds light on how successful PG algorithms optimize policies in the DRL
setting, and contributes insights into the practical issues in DRL.Comment: 12 pages, 9 figure
RLTF: Reinforcement Learning from Unit Test Feedback
The goal of program synthesis, or code generation, is to generate executable
code based on given descriptions. Recently, there has been an increasing number
of studies employing reinforcement learning (RL) to improve the performance of
large language models (LLMs) for code. However, these RL methods have only used
offline frameworks, limiting their exploration of new sample spaces.
Additionally, current approaches that utilize unit test signals are rather
simple, not accounting for specific error locations within the code. To address
these issues, we proposed RLTF, i.e., Reinforcement Learning from Unit Test
Feedback, a novel online RL framework with unit test feedback of
multi-granularity for refining code LLMs. Our approach generates data in
real-time during training and simultaneously utilizes fine-grained feedback
signals to guide the model towards producing higher-quality code. Extensive
experiments show that RLTF achieves state-of-the-art performance on the APPS
and the MBPP benchmarks. Our code can be found at:
https://github.com/Zyq-scut/RLTF
RLogist: Fast Observation Strategy on Whole-slide Images with Deep Reinforcement Learning
Whole-slide images (WSI) in computational pathology have high resolution with
gigapixel size, but are generally with sparse regions of interest, which leads
to weak diagnostic relevance and data inefficiency for each area in the slide.
Most of the existing methods rely on a multiple instance learning framework
that requires densely sampling local patches at high magnification. The
limitation is evident in the application stage as the heavy computation for
extracting patch-level features is inevitable. In this paper, we develop
RLogist, a benchmarking deep reinforcement learning (DRL) method for fast
observation strategy on WSIs. Imitating the diagnostic logic of human
pathologists, our RL agent learns how to find regions of observation value and
obtain representative features across multiple resolution levels, without
having to analyze each part of the WSI at the high magnification. We benchmark
our method on two whole-slide level classification tasks, including detection
of metastases in WSIs of lymph node sections, and subtyping of lung cancer.
Experimental results demonstrate that RLogist achieves competitive
classification performance compared to typical multiple instance learning
algorithms, while having a significantly short observation path. In addition,
the observation path given by RLogist provides good decision-making
interpretability, and its ability of reading path navigation can potentially be
used by pathologists for educational/assistive purposes. Our code is available
at: \url{https://github.com/tencent-ailab/RLogist}.Comment: accepted by AAAI 202
LLM-Based Agent Society Investigation: Collaboration and Confrontation in Avalon Gameplay
This paper aims to investigate the open research problem of uncovering the
social behaviors of LLM-based agents. To achieve this goal, we adopt Avalon, a
representative communication game, as the environment and use system prompts to
guide LLM agents to play the game. While previous studies have conducted
preliminary investigations into gameplay with LLM agents, there lacks research
on their social behaviors. In this paper, we present a novel framework designed
to seamlessly adapt to Avalon gameplay. The core of our proposed framework is a
multi-agent system that enables efficient communication and interaction among
agents. We evaluate the performance of our framework based on metrics from two
perspectives: winning the game and analyzing the social behaviors of LLM
agents. Our results demonstrate the effectiveness of our framework in
generating adaptive and intelligent agents and highlight the potential of
LLM-based agents in addressing the challenges associated with dynamic social
environment interaction. By analyzing the social behaviors of LLM agents from
the aspects of both collaboration and confrontation, we provide insights into
the research and applications of this domain